import numpy as np
import pandas as pd
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
df = pd.read_csv("chess.csv")
df.loc[df["rated"] == True, ["rated"]] = "Rated"
df.loc[df["rated"] == False, ["rated"]] = "Unrated"
df
The following graph shows a higher winrate for the white player, indicating a slight advantage. This is supported by the theory of "First-move advantage" in chess, which dictates how the two sides overall play throughout the game. It is more common, as a white player, to choose a more aggressive approach.
totalRated = df[df['rated'] == 'Rated'].shape[0]
ratedWinner = df[df['rated']=='Rated'].groupby('winner').size()/totalRated *100
totalUnrated = df[df['rated'] == 'Unrated'].shape[0]
unratedWinner = df[df['rated']=='Unrated'].groupby('winner').size()/totalUnrated *100
ratedWinner = ratedWinner.reset_index()
ratedWinner['rated'] = 'Rated'
unratedWinner = unratedWinner.reset_index()
unratedWinner['rated'] = 'Unrated'
winners = pd.concat([ratedWinner,unratedWinner])
winners = winners.rename(columns={0:'count'})
fig = px.histogram(winners, x="rated", y='count', color="winner", barmode="group", title="Total amount of games played grouped by winner"
, color_discrete_map={
"white": "#f5cc5b", "draw": "orange", "black":"#260f05"})
labels={"winner":"Winner", "count":"Count"}
fig.layout["xaxis"]["title"] = "Winner"
fig.layout["yaxis"]["title"] = "Percentage"
fig.show()
Normally, the number of unrated games would be higher than the number of rated games, in any given game. Interestingly enough, Lichess users tend to gravitate towards rated games rather than towards unrated ones.
A high percentage of resignations indicates how the average player tends to give up as soon as the game state becomes more and more difficult.
rated = df.loc[df["rated"] == 'Rated']
unrated = df.loc[df["rated"] == 'Unrated']
totalRated = df[df['rated'] == 'Rated'].shape[0]
ratedWinner = df[df['rated']=='Rated'].groupby('victory_status').size()/totalRated *100
totalUnrated = df[df['rated'] == 'Unrated'].shape[0]
unratedWinner = df[df['rated']=='Unrated'].groupby('victory_status').size()/totalUnrated *100
ratedWinner = ratedWinner.reset_index()
ratedWinner['rated'] = 'Rated'
unratedWinner = unratedWinner.reset_index()
unratedWinner['rated'] = 'Unrated'
winners = pd.concat([ratedWinner,unratedWinner])
winners = winners.rename(columns={0:'count'})
fig = px.histogram(winners, x="rated", y='count', color="victory_status", barmode="group", title="Total amount of games played grouped by endgame status")
labels={"winner":"Victory Status", "count":"Count"}
fig.layout["xaxis"]["title"] = "Victory status"
fig.layout["yaxis"]["title"] = "Percentage"
fig.show()
ELO (Elo rating system) is a way of rating players within a game system. It takes into account each win and loss, plus the number of overall games played. An upset is defined as a game where a lower rated player manages to win over a higher rated player. By dividing data into brackets given the difference between the two players' ratings, we can see how the higher the difference is, the least likely it is for an upset to occur. Of course, given how the matchmaking algorithm works, there is a lower chance for players with a high elo difference to be matched together, Thus the amount of games at higher brackets are lower, resulting in less precise plots.
mask_white = (rated["winner"] == "white") & (rated["white_rating"] > rated["black_rating"])
mask_black = (rated["winner"] == "black") & (rated["white_rating"] < rated["black_rating"])
rated["upset"] = True
rated.loc[mask_white, ["upset"]] = False
rated.loc[mask_black, ["upset"]] = False
rated['elo_difference'] = abs(rated['white_rating'] - rated['black_rating'])
rated['elo_interval'] = pd.cut(rated['elo_difference'],[0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300], include_lowest=True )
newDf = rated[rated['upset'] == True].groupby("elo_interval").size().reset_index().rename(columns={0: "count"})
tmp = rated.groupby("elo_interval").size().reset_index().rename(columns={0: "total"}).set_index('elo_interval')
newDf = newDf.set_index('elo_interval')
newDf = pd.concat([newDf,tmp], axis=1)
newDf['percentage'] = newDf['count']/newDf['total'] *100
newDf.index = newDf.index.astype(str)
fig = px.line(newDf,y='percentage' ,title="Probability of an upset given elo difference brackets")
fig.layout["xaxis"]["title"] = "Elo Difference"
fig.layout["yaxis"]["title"] = "Upset Percentage"
fig.show()
In the context of chess, an opening refers to a strategy, or a set of strategies, used in the initial stages of the game. There are multiple types of opening, but it can be useful to divide them in two categories, aggressive and defensive openings. An player using an aggressive opening attempts to establish the pace of the game, while trying to shut down the opponent's attempts at regaining control. A player using a defensive opening attempts to slow down the game and respond to their opponent's moves. As mentioned before, white usually tends to prefer aggressive openings, given how it always moves in advance.
The following graph reports the 10 most used openings in the context of rated games. The results are both expected and to a degree surprising. Considering the most used opening, the sicilian defense, is a defensive opening, there are a few things that come to mind. Of course, given the fact that the Sicilian Defense is a very popular opening, it isn't suprising we can find it in first place. That said, it is a defensive opening. Considering black prefers defensive openings, contrary to white, we can surmise that either black players use this opening very often, or white players also tend to prefer the Sicilian Defense, despite it being "against the norm".
rated['opening_type'] = rated['opening_name'].str.split(':').str.get(0)
total_games = rated.shape[0]
openings = rated.groupby('opening_type').size().reset_index(name='count').sort_values(by='count',ascending=False)
openings.loc[~openings['opening_type'].isin(openings.head(10)['opening_type']), 'opening_type'] = 'Others'
fig = px.pie(openings, values='count', names='opening_type', title='Most used opening types')
fig.show()
We ran the same analysis over a smaller dataset, this time comprised of games with players in the top 50. This implies that atleast one of the two players involved in the game was in the top 50 of all players, in terms of ELO.
top_white = rated
top_black = rated
top_white.sort_values(['white_id','last_move_at'], ascending=False)
top_white = top_white.groupby("white_id").first()
top_black = top_black.sort_values(['black_id','last_move_at'], ascending=False)
top_black = top_black.groupby("black_id").first()
top = pd.DataFrame()
top['white_date'] = top_white['last_move_at']
top['black_date'] = top_black['last_move_at']
top['white_date'] = top['white_date'].fillna(0)
top['black_date'] = top['black_date'].fillna(0)
top["white_rating"] = top_white["white_rating"]
top["black_rating"] = top_black["black_rating"]
top['white_rating'] = top['white_rating'].fillna(0)
top['black_rating'] = top['black_rating'].fillna(0)
top['last_rating'] = top['black_rating']
mask = top['white_date'] > top['black_date']
top.loc[mask , ['last_rating']] = top['white_rating']
top = top.sort_values('last_rating', ascending=False)
top50 = rated[(rated['white_id'].isin(top.head(50).index)) | (rated['black_id'].isin(top.head(50).index))]
total_games = top50.shape[0]
top_openings = top50.groupby('opening_type').size().reset_index(name='count').sort_values(by='count',ascending=False)
top_openings.loc[~top_openings['opening_type'].isin(top_openings.head(10)['opening_type']), 'opening_type'] = 'Others'
px.pie(top_openings, values='count', names='opening_type', title='Most used opening types - games including at least one player in the top 50')
A few interesting points:
opening_victory = rated[rated['opening_type'].isin(openings['opening_type'])].groupby(['opening_type', 'winner']).size().unstack()
opening_victory['total'] = opening_victory['black'] + opening_victory['white'] + opening_victory['draw']
opening_victory['white'] = (opening_victory['white'] / opening_victory['total'])*100
opening_victory['black'] = (opening_victory['black'] / opening_victory['total'])*100
opening_victory['draw'] = (opening_victory['draw'] / opening_victory['total'])*100
opening_victory = opening_victory.sort_values(by="white")
fig = px.bar(opening_victory, y=opening_victory.index, x=["white","draw", "black"], title="Winners grouped by the 10 most used openings", labels={
"variable": "Winner"},
color_discrete_map={
"white": "#f5cc5b", "draw": "orange", "black":"#260f05"
}
)
fig.layout["xaxis"]["title"] = "Percentage of victories"
fig.layout["yaxis"]["title"] = "Opening Type"
fig.show()
rated['elo_interval'] = pd.cut(rated['elo_difference'],[0,200,400,600,800,1000,1300], include_lowest=True )
newDf = rated.sort_values(by='elo_interval')
newDf['elo_interval'] = newDf['elo_interval'].astype(str)
fig = px.box(newDf, x='elo_interval', y="turns", title="Game moves number grouped by Elo Difference")
fig.layout["xaxis"]["title"] = "Elo Difference"
fig.layout["yaxis"]["title"] = "Moves"
fig.show()
The following set of graphs shows an interesting relation. The rook is the least used piece within the opening stage, but skyrockets after. Given the frequency of 'castling' (a move with which the king and the rook swap places), we can assume that the rook is mostly kept as a way to protect the king, and only rarely used, also due to its position, in the opening stage.
Standard chess notation:
Castling involves the king and a rook. It serves as a way to further protect the king while allowing the rook to take a more active role in the game. Castling can only be used if neither the rook and the king have moved.
all_moves = rated.moves.sum().split() #may take some time
all_moves_df = pd.DataFrame(all_moves)
all_moves_df = all_moves_df.rename(columns={0: "move"})
most_used_moves = pd.DataFrame(all_moves).groupby(0).size().reset_index(name='count').set_index(0).sort_values(by='count', ascending=False)
fig = px.bar(most_used_moves.head(10), x='count', title="Most frequently used moves")
fig.layout["xaxis"]["title"] = "Count"
fig.layout["yaxis"]["title"] = "Move"
fig.show()
def addPiece(df):
df['piece'] = 'Pawn'
df.loc[df.move.str.startswith('K'), ['piece']] = 'King'
df.loc[df.move.str.startswith('Q'), ['piece']] = 'Queen'
df.loc[df.move.str.startswith('B'), ['piece']] = 'Bishop'
df.loc[df.move.str.startswith('N'), ['piece']] = 'Knight'
df.loc[df.move.str.startswith('R'), ['piece']] = 'Rook'
df.loc[df.move.str.startswith('O'), ['piece']] = 'Castling'
Interestingly enough, the order shown in the graph follows the natural order of the pieces' values (Queen 9, Rook 5, Bishop 3, Knight 3, Pawn 1), expect for the king and the rook, which are skewed due to castling.
addPiece(all_moves_df)
data = all_moves_df.groupby('piece').size().reset_index(name='count')
total_moves = len(all_moves)
data = data.set_index('piece')
castlings = data.loc['Castling']['count']
kings = data.loc['King']['count']
rooks = data.loc['Rook']['count']
data.loc[data.index == 'King',['count']] = kings + castlings
data.loc[data.index == 'Rook',['count']] = rooks + castlings
data = data.drop('Castling').sort_values(by='count')
fig = px.bar(data, y=(data['count']/total_moves) *100, title="Most frequently used pieces")
fig.layout["xaxis"]["title"] = "Piece"
fig.layout["yaxis"]["title"] = "Percentage"
fig.show()
split_moves = pd.DataFrame(rated.moves.str.split()) #may take some time
split_moves['turns'] = rated['turns']
split_moves['opening_ply'] = rated['opening_ply']
opening_moves = split_moves[['moves' ,'opening_ply']].apply(lambda x: x['moves'][:x['opening_ply']], axis=1)
opening_moves = pd.DataFrame(opening_moves)
opening_moves = opening_moves.rename(columns={0: "moves"})
opening_moves = opening_moves.moves.sum()
all_opening_moves = pd.DataFrame(opening_moves)
all_opening_moves = all_opening_moves.rename(columns={0: "move"})
addPiece(all_opening_moves)
data = all_opening_moves.groupby('piece').size().reset_index(name='count')
data = data.set_index('piece')
castlings = data.loc['Castling']['count']
rooks = data.loc['Rook']['count']
data.loc[data.index == 'Rook',['count']] = rooks + castlings
data = data.drop('Castling').sort_values(by='count')
total_moves = len(opening_moves)
fig = px.bar(data, y=(data['count']/total_moves) *100, title="Most frequently used pieces in openings")
fig.layout["xaxis"]["title"] = "Piece"
fig.layout["yaxis"]["title"] = "Percentage"
fig.show()